date symbol pvm_0001 sentiment_0001 onchain_0001 target_neutral
<Date> <char> <int> <int> <int> <num>
1: 2018-04-29 ADA 96 1 1 0.4102564
2: 2018-04-29 AGIX 28 29 2 0.4358974
3: 2018-04-29 BAT 48 12 51 0.3333333
4: 2018-04-29 BCH 33 30 100 0.9743590
5: 2018-04-29 BTC 36 0 0 0.7435897
---
94621: 2022-10-30 ZORA 91 50 85 0.9228571
94622: 2022-10-30 ZRX 62 50 5 0.6071429
94623: 2022-10-30 ZYN 4 50 51 0.8485714
94624: 2022-10-30 eRSDL 91 50 89 0.1614286
94625: 2022-10-30 stETH 81 50 51 0.3814286
target_updown
<num>
1: -0.041188088
2: -0.040989237
3: -0.059565759
4: 0.225918211
5: 0.025025799
---
94621: 0.171695311
94622: 0.019647720
94623: 0.092574865
94624: -0.062505238
94625: -0.009704795
Report: YIEDL Experiment 2 (YIEDL-Numerai Dataset)
Version: 0.1
1 Introduction
For the Yiedl competition we distribute a daily dataset, with week-to-week targets, which has fewer features than the new dataset given to Numerai for their crypto competition. We should compare the performance of the two dataset under a variety of models and check if it makes sense to push the adoption of the new dataset in our competitions.
Experiment two is, in many ways, similar to experiment one. The main focus here is on the Full Daily Data (i.e. the dataset for Numerai Crypto with 3000+ features). In addition to making out-of-bag predictions on weekly test data only, this experiment also covered the predictions on daily test data.
1.1 Results Overview (Long Story Short)
If you’re swamped with deadlines and your coffee’s ice-cold, here’s the straight-to-the-point rundown on the results and conclusion to keep you informed without the fuss!
Daily models do not improve out-of-bag predictive performance on weekly neutral targets, indicating the Full Daily Data might not be suitable for the classic YIEDL Neutral competition in the current weekly tournament format. ❌
Daily models do show better out-of-bag predictive performance on daily neutral targets, indicating the Full Daily Data is suitable for the daily submission (i.e. the current Numerai Crypto format). ✔️
Daily models do show better out-of-bag predictive performance on both weekly updown targets, indicating the Full Daily Data is suitable for the classic YIEDL Updown competition in the current weekly tournament format. ✔️
Daily models do show better out-of-bag predictive performance on daily updown targets. Yet, this might not be helpful for Numerai Crypto as their targets are similar to the normalised netural targets. 🤷
See conclusions below for more details.
2 Experiment Set-up
2.1 Datasets
The following two datasets from https://yiedl.ai/competition/datasets were used:
- Targets from YIEDL Weekly Data -
dataset_weekly_2025_15.zip - Targets from YIEDL Daily Data -
dataset_daily_2025_15.zip - Features from Full (aka YIEDL-Numerai) Daily Data -
dataset_historical_20250401.zip
2.2 Training vs. Test Periods
- Training: 2018-04-27 to 2022-10-31
- Embargo: 2022-11-01 to 2022-12-31 (a two-month gap between training and test to avoid data leakage)
- Test: 2023-01-01 to 2025-03-31
2.3 Stats
- Training (Weekly) = 94625 samples from 2018-04-29 to 2022-10-30.
- Training (Daily) = 660847 samples from 2018-04-27 to 2022-10-31.
- Test (Weekly) = 142039 samples from 2023-01-01 to 2025-03-30.
- Test (Daily) = 988574 samples from 2023-01-01 to 2025-03-31.
2.4 Features and Targets
There are 3669 features and 2 targets (target_neutral and target_updown) in the datasets. Here is an example of the weekly training data:
2.5 Grid Search
The following parameters were used to build xgboost regression models:
# Define parameters for xgboost
params <- list(objective = "reg:squarederror", # fixed
eta = 0.01, # fixed
max_bin = 63, # fixed
tree_method = "gpu_hist", # fixed
gpu_id = 0, # fixed
booster = "gbtree", # fixed
max_depth = input_params$max_depth, # see below
max_leaves = 2**input_params$max_depth - 1, # see below
subsample = input_params$subsample, # see below
colsample_bytree = input_params$colsample_bytree) # see below
# Train xgboost model
model_weekly <- xgb.train(params = params, # as shown above
data = training_dataset, # training dataset for each target
nrounds = input_params$nrounds) # see belowThe dynamic variables for the grid search:
max_depth: 3, 4, 5, 6, 7max_leaves: calculated usingmax_depthsubsample: 0.5, 0.75, 1colsample_bytree: 0.05, 0.1, 0.2, 0.3, 0.4round: 500, 1000, 1500, 2000
2.6 Models
The models can be categorised into eight groups:
- 300 models trained with weekly data +
target_neutral—> predict on weekly data - 300 models trained with daily data +
target_neutral—> predict on weekly data - 300 models trained with weekly data +
target_neutral—> predict on daily data - 300 models trained with daily data +
target_neutral—> predict on daily data - 300 models trained with weekly data +
target_updown—> predict on weekly data - 300 models trained with daily data +
target_updown—> predict on weekly data - 300 models trained with weekly data +
target_updown—> predict on daily data - 300 models trained with daily data +
target_updown—> predict on daily data
Note: each model is a simple average ensemble from three runs with three different random seed.
Note 2: since the complexity of the experiment has increased significantly due to increased number of features (from 1140 to 3669) and model groups (from 4 to 8). I had to reduce the size of the grid search (from 1008 to 300) in order to complete a reasonable number of runs within a few days.
3 Predictions
Here is an example of predictions from models trained with the neutral targets:
date symbol yhat_weekly yhat_daily
<Date> <char> <num> <num>
1: 2023-01-01 0xBTC 0.14035088 0.11336032
2: 2023-01-01 1ECO 0.60863698 0.68690958
3: 2023-01-01 1INCH 0.83265857 0.79082321
4: 2023-01-01 1WO 0.92712551 0.95411606
5: 2023-01-01 AAC 0.03643725 0.05668016
6: 2023-01-01 AAVE 0.76518219 0.85425101
Similarly, we can look at the predictions from models trained with the updown targets:
date symbol yhat_weekly yhat_daily
<Date> <char> <num> <num>
1: 2023-01-01 0xBTC -0.04275151 -0.01003887
2: 2023-01-01 1ECO 0.05094402 0.09078478
3: 2023-01-01 1INCH -0.00897382 -0.00142727
4: 2023-01-01 1WO -0.03885654 -0.01133472
5: 2023-01-01 AAC -0.00622269 0.03552798
6: 2023-01-01 AAVE -0.00843999 -0.00597358
4 Evaluation Metrics
I am skipping the full explanations here as the metrics are exactly the same as the ones used in experiment one. Please see report one for more details.
- Primary metrics: Spearman correlation, RMSE
- Secondary metrics: Sharpe ratio, max drawdown, compound return, and trimmed mean.
5 Report Structure for Evaluation Results
Note: in experiment two, we also look at the out-of-bag performance on daily test data (from 2023-01-01). Therefore, we have two different model groups in this report:
X-to-Weeklymeans both weekly and daily models are evaluated using the same weekly test data.X-to-Dailymeans both weekly and daily models are evaluated using the same daily test data.
In order to simplify things, the results for each evaluate metric are presented in the following structure:
- Metric
- Group One (X-to-Weekly)
- Group Two (X-to-Daily)
You can use the table of contents to go through the following sections:
- Mean Spearman Correlation (Groups One and Two)
- Sharpe Ratio (Groups One and Two)
Max Drawdown (Groups One and Two)(omitted for now as some models have zero drawdown - not useful for comparison)- Compound Return (Groups One and Two)
- Trimmed Mean RMSE (Groups One and Two)
The hypothesis / expectations are the same as those in experiment one so I am skipping them in this report.
6 Mean Spearman Correlation (Target Neutral)
6.1 Group One (X-to-Weekly)
6.1.1 Observations (Stats)
No. of daily models with higher mean correlation = 15 out of 300 (5%) ❌
Range of weekly-to-weekly models’ mean correlation (cor_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1617 0.1699 0.1727 0.1718 0.1740 0.1758
- Range of daily-to-weekly models’ mean correlation (cor_d_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1601 0.1678 0.1700 0.1692 0.1715 0.1737
- Range of raw performance differences (cor_daily - cor_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.006481 -0.003803 -0.002387 -0.002625 -0.001400 0.001333
- Range of percentage differences (%) (diff / cor_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.7811 -2.1805 -1.3763 -1.5247 -0.8179 0.7963
6.1.2 Observations (Charts)
6.1.3 Result Table
Notes:
depth=max_depthrsamp=subsamplecsamp=colsample_bytreeround=roundcor_w_w= mean correlation of weekly-to-weekly models’ predictionscor_d_w= mean correlation of daily-to-weekly models’ predictionsdiff=cor_d_w-cor_w_w(i.e. positive differences mean the daily models are better)p_diff=diff/cor_w_w * 100percentage difference (%)
6.2 Group Two (X-to-Daily)
6.2.1 Observations (Stats)
No. of daily models with higher mean correlation = 217 out of 300 (72.3333333%) ✔️
Range of weekly-to-daily models’ mean correlation (cor_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1496 0.1569 0.1590 0.1584 0.1603 0.1622
- Range of daily-to-daily models’ mean correlation (cor_d_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.1501 0.1577 0.1598 0.1592 0.1615 0.1637
- Range of raw performance differences (cor_d_d - cor_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-0.0022650 -0.0001138 0.0008720 0.0008062 0.0016975 0.0036830
- Range of percentage differences (%) (diff / cor_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.47744 -0.07244 0.54741 0.50665 1.07789 2.39334
6.2.2 Observations (Charts)
6.2.3 Result Table
Notes:
depth=max_depthrsamp=subsamplecsamp=colsample_bytreeround=roundcor_w_d= mean correlation of weekly-to-daily models’ predictionscor_d_d= mean correlation of daily-to-daily models’ predictionsdiff=cor_d_w-cor_w_w(i.e. positive differences mean the daily models are better)p_diff=diff/cor_w_w * 100percentage difference (%)
8 Compound Return (Target Neutral)
8.1 Group One (X-to-Weekly)
8.1.1 Observations (Stats)
No. of daily models with higher compound return = 16 out of 300 (5.3333333%) ❌
Range of weekly-to-weekly models’ Sharpe ratio (return_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
561.3 627.7 651.4 644.7 663.6 679.4
- Range of daily-to-weekly models’ Sharpe ratio (return_d_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
549.5 610.2 629.0 622.5 641.8 660.3
- Range of raw performance differences (return_d_w - return_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-55.05 -32.99 -20.41 -22.20 -11.70 11.04
- Range of percentage differences (%) (diff / return_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-8.320 -4.960 -3.121 -3.426 -1.840 1.819
8.1.2 Observations (Charts)
8.1.3 Result Table
Notes:
depth=max_depthrsamp=subsamplecsamp=colsample_bytreeround=roundreturn_w_w= Compound return of weekly-to-weekly models’ predictionsreturn_d_w= Compound return of daily-to-weekly models’ predictionsdiff=shp_d_w-return_w_w(i.e. positive differences mean the daily models are better)p_diff=diff/return_w_w * 100percentage difference (%)
8.2 Group Two (X-to-Daily)
8.2.1 Observations (Stats)
No. of daily models with higher compound return = 217 out of 300 (72.3333333%) ✔️
Range of weekly-to-daily models’ Sharpe ratio (return_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
476.5 528.3 543.7 539.4 554.0 568.2
- Range of daily-to-daily models’ Sharpe ratio (return_d_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
480.3 534.1 550.0 545.6 563.2 580.1
- Range of raw performance differences (return_d_d - return_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-15.7902 -0.8395 6.6565 6.1861 13.0039 26.7853
- Range of percentage differences (%) (diff / return_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.1433 -0.1595 1.2135 1.1352 2.3816 5.2715
8.2.2 Observations (Charts)
8.2.3 Result Table
Notes:
depth=max_depthrsamp=subsamplecsamp=colsample_bytreeround=roundreturn_w_d= Compound return of weekly-to-daily models’ predictionsreturn_d_d= Compound return of daily-to-daily models’ predictionsdiff=shp_d_d-return_w_d(i.e. positive differences mean the daily models are better)p_diff=diff/return_w_d * 100percentage difference (%)
9 Trimmed Mean RMSE (Target Updown)
9.1 Group One (X-to-Weekly)
9.1.1 Observations (Stats)
No. of daily models with lower trimmed RMSE = 300 out of 300 (100%) ✔️
Range of weekly-to-weekly models’ trimmed mean RMSE (rmse_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5199 0.6167 0.7073 0.7670 0.8039 2.4007
- Range of daily-to-weekly models’ trimmed mean RMSE (rmse_d_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4636 0.5359 0.5824 0.5854 0.6266 0.7796
- Range of raw performance differences (rmse_d_w - rmse_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.79362 -0.17714 -0.10102 -0.18162 -0.06290 -0.01259
- Range of percentage differences (%) (diff / rmse_w_w):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-75.292 -24.176 -14.416 -19.009 -10.378 -1.656
9.1.2 Observations (Charts)
9.1.3 Result Table
Notes:
depth=max_depthrsamp=subsamplecsamp=colsample_bytreeround=roundrmse_w_w= Trimmed mean RMSE of weekly-to-weekly models’ predictionsrmse_d_w= Trimmed mean RMSE of daily-to-weekly models’ predictionsdiff=rmse_d_w-rmse_w_w(i.e. negative differences mean the daily models are better)p_diff=diff/rmse_w_w * 100percentage difference (%)
9.2 Group Two (X-to-Daily)
9.2.1 Observations (Stats)
No. of daily models with lower trimmed RMSE = 299 out of 300 (99.6666667%) ✔️
Range of weekly-to-daily models’ trimmed mean RMSE (rmse_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.5275 0.6303 0.7185 0.7799 0.8204 2.4186
- Range of daily-to-daily models’ trimmed mean RMSE (rmse_d_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.4668 0.5515 0.5976 0.6023 0.6478 0.8049
- Range of raw performance differences (rmse_d_d - rmse_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-1.80006 -0.18232 -0.09683 -0.17763 -0.06048 0.01493
- Range of percentage differences (%) (diff / rmse_w_d):
Min. 1st Qu. Median Mean 3rd Qu. Max.
-74.94 -23.41 -13.56 -18.19 -9.10 2.00
9.2.2 Observations (Charts)
9.2.3 Result Table
Notes:
depth=max_depthrsamp=subsamplecsamp=colsample_bytreeround=roundrmse_w_d= Trimmed mean RMSE of weekly-to-daily models’ predictionsrmse_d_d= Trimmed mean RMSE of daily-to-daily models’ predictionsdiff=rmse_d_w-rmse_w_w(i.e. negative differences mean the daily models are better)p_diff=diff/rmse_w_w * 100percentage difference (%)
10 Conclusions
A grid search (300 combinations of different xgboost parameters) was used for this experiments.
Pairs of weekly and daily models (trained using the same parameters) were used to produce out-of-bag predictions on the same weekly test data as well as the same daily test data from 2023-01-01.
Daily models do not improve out-of-bag predictive performance on weekly neutral targets, indicating the Full Daily Data might not be suitable for the classic YIEDL Neutral competition in the current weekly tournament format. ❌
Daily models do show better out-of-bag predictive performance on daily neutral targets, indicating the Full Daily Data is suitable for the daily submission (i.e. the current Numerai Crypto format). ✔️
Daily models do show better out-of-bag predictive performance on both weekly updown targets, indicating the Full Daily Data is suitable for the classic YIEDL Updown competition in the current weekly tournament format. ✔️
Daily models do show better out-of-bag predictive performance on daily updown targets. Yet, this might not be helpful for Numerai Crypto as their targets are similar to the normalised netural targets. 🤷
10.1 Summary (X-to-Weekly, Target Neutral)
- Only a few (5%) daily models show HIGHER mean Spearman correlation compared to weekly models ❌
- Most (62%) daily models show HIGHER (average 0.8% increase in) Sharpe ratio compared to weekly models, indicating better performance. ✔️
- Only a few (5%) daily models show HIGHER compound return compared to weekly models. ❌
10.2 Summary (X-to-Daily, Target Neutral)
- Most (72%) daily models show HIGHER (average 0.5% increase in) mean Spearman correlation compared to weekly models, indicating better performance. ✔️
- Most (62%) daily models show HIGHER (average 0.8% increase in) Sharpe ratio compared to weekly models, indicating better performance. ✔️
- Most (72%) daily models show HIGHER (average 1.1% increase in) compound return compared to weekly models, indicating better performance. ✔️
10.3 Summary (X-to-Weekly, Target Updown)
- All (100%) daily models show LOWER (average 19% decrease in) trimmed mean RMSE compared to weekly models, indicating better performance. ✔️
10.4 Summary (X-to-Daily, Target Updown)
- Most (99%) daily models show LOWER (average 18% decrease in) trimmed mean RMSE compared to weekly models, indicating better performance. ✔️